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Abstract. Real-data networks often appear to have strong modular- 
ity, or network-of-networks structure, in which subgraphs of various size 
and consistency occur. Finding the respective subgraph structure is of 
great importance, in particular for understanding the dynamics on these 
networks. Here we study modular networks using generalized method of 
maximum likelihood. We first demonstrate how the method works on 
computer-generated networks with the subgraphs of controlled connec- 
tion strengths and clustering. We then implement the algorithm which 
is based on weights of links and show it's efficiency in finding weighted 
subgraphs on fully connected graph and on real-data network of yeast. 
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1 Introduction 

Complex dynamical systems can be adequately represented by networks with a 
diversity of structural and dynamical characteristics, [1], [2] and [3]. Often such 
networks appear to have multiscale structure with subgraphs of different sizes 
and topological consistency. Some well known examples include gene modules 
on genetic networks [4], social community structures [5], topological clusters 
[6] or dynamical aggregation on the Internet, to mention only a few. It has 
been understood that in the evolving networks some functional units may have 
emerged as modules or communities, that can be topologically recognized by 
better or tighter connections. Finding such substructures is therefore of great 
importance primarily for understanding network's evolution and function. 

In recent years great attention has been devoted to the problem of commu- 
nity structure in social and other networks, where the community is topologi- 
cally defined as a subgraph of nodes with better connection among the members 
compared with the connections between the subgraphs, [5] and [7]. Variety of al- 
gorithms have been developed and tested, a comparative analysis of many such 
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algorithms can be found in [5]. Mostly such algorithms are based on the the- 
orem of maximal-flow-minimal-cut [8], where naturally, maximum topological 
flow falls on the links between the communities. Recently a new approach was 
proposed based on the maximum-likelihood method, [9]. In maximization like- 
lihood method an assumed mixture model is fit to a given data set. Assuming 
that the network nodes can be split into G groups, where group memberships 
are unknown, then the expectation-maximization algorithm is used in order to 
find maximum of the likelihood that suites the model. As a result a set of prob- 
abilities that a node belongs to a certain group are obtained. The probabilities 
corresponding to global maximum of the likelihood are expected to give the best 
split of the network into given number of groups. 

In complex dynamical networks, however, other types of substructures may 
occur, that arc not necessarily related to "better connectivity" measure. Gen- 
erally, the substructures may be diffcrentiable with respect certain functional 
(dynamical) constraints, such as limited path length (or cost), weighted sub- 
graphs, or subgraphs that are synchronizable at a given time scale. Search for 
such types of substructures may require new algorithms adapted to the respec- 
tive dynamical constraints. 

In this work we adapt the maximum-likelihood methods to study subgraphs 
with weighted links in real and computer-generated networks. We first intro- 
duce a new model to generate a network-of-networks with a controlled subgraph 
structure and implement the algorithm, [9], to test its limits and ability to find 
the a priori known substructures. We then generalize the algorithm to incorpo- 
rate the weights on the links and apply it to find the weighted subgraphs on a 
almost fully connected random graph with known weighted subgraphs and on a 
real-data network of yeast gene-expression correlations. 

2 Network of Networks: Growth Algorithm and Structure 

We introduce an algorithm for network growth with a controlled modularity. As 
a basis, we use the model with preferential attachment and preferential rewiring, 
[10], which captures the statistical features of the World Wide Web. Two param- 
eters a and a control the emergent structure of the Webgraph when the average 
number of links per node M is fixed. For instance, for M = 1 : when a < 1 the 
emergent structure is a scale-free clustered and correlated network, in particular 
the case a = a = 1/4 corresponds to the properties measured in the WWW [10]; 
when a = 1 a scale-free tree structure emerges with the exponents depending on 
the parameter a. Here we generalize the model in a nontrivial manner to per- 
mit development of distinct subnetworks or modules. The number of different 
groups of nodes is controlled by additional parameter P Q . Each subgroup evolves 
according to the rules of Webgraph. 

At each time step t we add a new node i and M new links. With probability 
P a new group is started. The added node is assigned current group index. 
(First node belong to the first group.) The group index plays a crucial role in 
linking the node to the rest of the network. The links are created by attaching 
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the added node inside the group, with probability <S, or else rewiring within the 
entire network. The target node k is selected preferentially with respect to the 
current situation in the group, which determines the linking probability Pi n (k, t). 
Similarly, the node which rewires the link n is selected according to its current 
number of outgoing links, which determines the probability p ou t(n,t): 



Pin 

(M) 



Ma + q in (k, t) 
tMa + L gk (t) 



Pout(n,t) 



Ma + q out {n,t) 
tM(a + l) 



(1) 



where qi n (k,t) and q ut(n,t) are in- or out-degrees of respective nodes at time 
step t, tM(a + 1) is number of links in whole network, while L gk (t) is number of 
link between nodes in a group of node k. It is assumed that qi n (i, i)=q ut(h i) = 0. 
Suggested rules of linking insure existence of modules in the network. Each group 
has a central node, hub, in terms of in-degree connectivity, and a set of nodes 
along which it is connected with the other groups. The number of groups G in 
the network depends on number of nodes, TV, and the parameter P Q as G ~ NP Q . 
Some emergent modular structures are shown in Fig. 1. For the purpose of this 
work we only mention that the networks grown using the above rules are scale- 
free with both in-coming and out-going links distributed as (/c="in" or "out"): 



P(Qk) 



(2) 



The scaling exponents rj n and r out vary with the parameters a and P Q . In Fig. 2 
we show cumulative distribution of in- and out-links in the case N=25000 nodes, 
M=5, a — a = 0.9 and number of groups G — 6. The slopes are Ti n — 1 = 
1.616 ± 0.006 and r out - 1 = 7.6 ± 0.3. 




Fig. 1. Network of networks generated by the algorithm described above for iV = 1000 
nodes and different combinations of control parameters a,& , P and M. Different colors 
represent topological subgraphs as found by the maximum-likelihood method. 



These networks with controlled modularity will be considered in the next Sec- 
tion to test the maximum-likelihood algorithm for finding subgraphs. In addition, 
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we will apply the method on a gene network, in which the modular structure is 
not known. The network is based on the empirical data of gene expressions for 
a set of 1216 cell-cycle genes of yeast measured at several points along the cell- 
cycle, selected from [11]. The pairs of genes are connected with weighted links 
according to their expression correlation coefficient. In such network, for the 
correlations exceeding a critical value Wq c a percolation-like transition occurs 
where functionally related clusters of genes join the giant cluster [12]. However, 
below that point the network is too dense and separation of the modules be- 
comes difficult. The topological betweenness-centrality measures for both, nodes 
and links in the gene network shown in Fig. 2 (right), exhibit broad distribution, 
suggesting a nontrivial topology of the network. 




Fig. 2. (left) Cumulative distribution of in-coming and out-going links for grown mod- 
ular network, (right) Distribution of betwenness-centrality for nodes and links in the 
gene-expression network of Yeast. 



3 Maximum Likelihood Methods for Weighted Graphs 

The method is based on a mixture model and numerical technique known as 
the expectation-maximization algorithm. We first describe the basic idea for 
unweighted networks, [9], and then generalize the method for weighted graphs. 

3.1 Theoretical Background 

A network of N nodes, directed or undirected, is represented mathematically 
by an adjacency matrix A with N x N elements. The elements Aij as (1,0), 
represent the presence/absence of a link between nodes. The idea is to construct 
a mixture model network partitioned into G groups, where members of the groups 
are similar in some sense and the numbers gi denote the group to which vertex 
i belongs [9], Group memberships are unknown, they are commonly referred as 
"hidden" data. The basic idea is to vary parameters of a suitable mixture model 
to find the best fit to the observed network. 
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The model parameters are: 9 r i, defined as the probability that a link from 
some node in group r connects to node i, and n r , representing the probability 
that a randomly chosen vertex falls in group r. The normalization conditions 

$>r = l, 5>„ = 1, (3) 

r i 

are required. The parameters can be estimated by the maximum likelihood crite- 
rion using expectation-maximization algorithm. In the present case the problem 
reduces to maximization of the likelihood Pr(A, g\n, 9) with respect to tt and 9. 
Using the factorization rule, we can write Pr(A, g\ir, 9) in the following form: 

Pr(A, g\n, 9) = Pr(A\g, tt, 9)Pr{g\v, 9) , (4) 

where 

Pr(A\g,n,e) = l[e^ , Pr{g\-K,6) = JJtt* . (5) 

ij i 

Combining Eqs. (4) and (5) we obtain 

Pr(A, .9M)=n^Ife-- ( 6 ) 

3 



It is common to use the logarithm of likelihood instead of the likelihood 
itself [9]. In addition, averaging of the log- likelihood over distribution of group 
memberships g is necessary with the distribution Pr(g\A, tt, 9), leading to 

G G 

1 = E ' ' ' E Pr ^ A > tt, 0) ]T [lnn gi + £ A vi ln0 :li , L (7) 

91 9n i 3 

= E qir^nitr + Aijln6 gi j] , 
ir j 

where qi r — Pr{gi — r\A, it, 9) represents the probability that node i belongs 
to group r. Using again the factorization rule for Pr(A,gi — r\ir, 9) in the case 
when A represents the missing data we find the expression for q ir 

Qir = Pr( 9i = r\A, n, 9) = ^ f= ^ = ■ W 

Pr(A\7r,9) Es^UjOf; 1 

Now we can use qi r given by the Eq. (8) to evaluate the expected value of 
log-likelihood and to find 7r r and 9 r i which maximize it. The maximization can 
be carried out analytically. Using the method of Lagrange multipliers to enforce 
the normalization conditions in (3), we find relations for the parameters 7r r and 

9ri' 

n Qout{j)qjr 
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In the numerical implementation, starting from an initial partitioning and it- 
erating the Eqs. (8) and (9) towards convergence, we determine the modular 
structure, which is defined by quantities qi r . In practice, the runtime to the 
convergence depends on the number of nodes and number of groups, G. 

In practical calculations, the clustering algorithm converges to one out of 
many local maxima, which is sensitive to the initial conditions. Hence the choice 
of initial values of the model parameters is a not trivial step. The obvious unbi- 
ased choice of the initial values is the symmetric point with 7r° = - and 6% = — , 

which is consistent with normalization conditions for numbers 7r° and 6°. Unfor- 
tunately, this represents a trivial fixed point of the iterations. Instead, we find 
that the starting values that are perturbed randomly at small distance from the 
fixed point are better in terms of convergence of algorithm to right maxima of 
expected value of log-likelihood. After initialization of the parameters, we com- 
pute q%. using Eq. (8) and then that irj. and Q x ri according to Eq. (9), etc. After 
some number of iterative steps, the algorithm will converge to a local maximum 
of the likelihood. In order to find the global maximum, it is recommendable to 
perform several runs with different initial conditions. 

We test the the MLM algorithm on the networks grown by the model 
represented in Section 2, for a wide range values of parameters M, Pq, a and 
a. As we expected, the algorithm works well on networks in which the modules 
with respect to connectivity — most of the links are between vertices inside the 
group — arc well defined, which is the case for large parameter a. The partitions 
are shown in Fig. 1: Clusters of the network found by the algorithm corresponds 
perfectly to the division derived from the growth. Group membership suggested 
by the algorithm and the original one are the same for the 98% of the nodes 
in the network. The algorithm even finds the "connector" nodes, that play a 
special role in each group. However, it is less efficient for the networks with 
large density of links between groups, as for instance for networks with a « 0.6 
or lower. Some other observations: Number and size of different groups do not 
affect the efficiency of the algorithm. Size of the network and sparseness of the 
network affect the convergence time. As a weak point, the number of groups G 
is an input parameter. 

3.2 Generalization of the Algorithm for Weighted Networks 

The topological structure of networks, usually expressed by the presence or ab- 
sence of links, can be considerably altered when links or nodes aquire different 
weights. Here we modify the algorithm presented in Section 3.1 in order to take 
into account weighted networks with modular structure. The main idea is based 
on the fact that a weighed link between a pair of nodes on the network can 
be considered as multiple links between that pair of nodes. Then a straightfor- 
ward generalization of the MLM is to apply the mixture model and expectation- 
maximization algorithm described above to the multigraph constructed with the 
appropriate number of links between pairs of nodes. The quantities to be consid- 
ered are: measured matrix of weights, gt missing data, and model parameters 
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{n r ,6 r i}. Following the same steps as above leads to the following expressions 
which are relevant for the algorithm: 



(10) 



nr = ^121L , ^ = -J - , (ii) 
n ^ ljq jr 

where lj — J2i Wj» is the summation over all weights of links emanating from 
the node j. The implementation of the algorithm and the choice of initial values 
of parameters are illustrated in previous chapter. Although the formal analogy 
between the expressions in Eqs. (9) and (11) and also between Eqs. (8) and (10) 
occurs with W,- L j — > , the important difference occurs in the quantity lj in Eq. 
(11), which is a measure of strength of node rather than its connectivity. There- 
fore, within the weighted algorithm, nodes of same strength appear to belong 
to the same community. The weighted communities may have important effects 
in the dynamics, as for instance, the clusters of nodes with same strength tend 
to synchronize at the same time scale. Such properties of the weighted networks 
remain elusive for the classical community structure analysis based on max-min 
theorem, mentioned in the introduction. 

In the remaining part of the Section we apply the algorithm to two weighted 
networks: first we demonstrate how it finds weighted subgraphs on a computer 
generated random graph with large density of links, and then on a gene expres- 
sion network of yeast generated from the empirical expression data. The results 
are given in Fig. 3. A random graph of N = 100 nodes with a link between each 
pair of nodes occurring with probability p — 0.5 is generated. In this homoge- 
neously connected graph we create G = 4 groups and assign different weights 
to links in each group of nodes. In the network of yeast gene expressions, as 
described in Section 2, the weights of links appear through the correlation co- 
efficient of the gene expressions. As the Fig. 3 shows, the algorithm retrieves 
accurately four a-priori known weighted groups of nodes on the random graph. 
Similarly, in the case of gene networks, we tried several partitiones with the po- 
tentially different number of groups. One such partition with G = 5 groups is 
shown in Fig. 3. Each weighted group on the gene network actually represents a 
set of genes which are closely co-expressed during the entire cell cycle. 



4 Conclusions 

We have extended the maximum-likelihood-method of community analysis to 
incorporate multigraphs (wMLM) and analysed several types of networks with 
mesoscopic inhomogeneity. Our results show that the extended wMLM can be 
efficiently applied to search for variety of subgraphs from a clear topological 
inhomogeneity with network-of-networks structure, on one end, to hidden sub- 
graphs of node with the same strength on the other. 
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Fig. 3. Weighted clusters found by the extended ML algorithm in the correlation net- 
work of gene expressions of yeast (left), and in the weighted random graph (right). 
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